Clean up data

Overview

In this homework assignment, we will explore, analyze and model a data set containing approximately 8000 records representing a customer at an auto insurance company. We will build multiple linear regression on the continuous variable TARGET_AMT and binary logistic regression on the TARGET_FLAG using the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car.

We are going to build several models using different techniques and variable selection. In order to best assess our predictive model, we created a test set within our training data, and split it along an 80/20 training/testing proportion, before applying the finalized models to a separate evaluation dataset that did not contain the target.

1. Data Exploration

The insurance training dataset contains 8161 observations of 26 variables, each record represents a customer at an auto insurance company. The evaluation dataset contains 2141 observations of 26 variables. The descriptions of each column are below.

Each record has two response variables. The first response variable, TARGET_FLAG, is a 1 or a 0. A “1” means that the person was in a car crash. A zero means that the person was not in a car crash. The second response variable is TARGET_AMT. This value is zero if the person did not crash their car. But if they did crash their car, this number will be a value greater than zero.

1.1 Summary Statistics

The training data can be previewed below. The TARGET_FLAG column is the binary dependent variable denoting if a car was in a crash (target = 1) or not (target = 0). TARGET_AMT is a numeric dependent variable and represents the amount of time the car spent on repairs in case of crash. The minimum is 0 (car wasn’t in crash, no time spent on repairs), the maximum is 107586.1.

The table below provides valuable descriptive statistics about the training data. 14 variables are categorical, 12 variables are numeric. There is no missing data for categorical variables while numeric variables YOJ (years on job) has 5% of missing data, CAR_AGE (vehicle age) has 6%, and AGE (age of driver) has less than 1%. Most of the numeric variables have a minimum of zero. Some numbers seem strange, we should deal with it later. For example, CAR_AGE has the minimum value of -3. Some of the variables are character though they should be numeric and vice versa. Variable OLDCLAIM, BLUEBOOK, HOME_VAL, INCOME have $ sign in front a number, we should remove the sign and transform the variable to numeric. Variables MSTATUS (Marital Status), EDUCATION, JOB, CAR_TYPE, SEX AND URBANICITY contain prefix z_ that should be removed as well. ADD SOMETHING ELSE FOR SUMMARY

Data summary
Name train_df
Number of rows 8161
Number of columns 25
_______________________
Column type frequency:
character 10
numeric 15
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
PARENT1 0 1 2 3 0 2 0
MSTATUS 0 1 2 3 0 2 0
SEX 0 1 1 1 0 2 0
EDUCATION 0 1 3 12 0 5 0
JOB 0 1 6 12 0 9 0
CAR_USE 0 1 7 10 0 2 0
CAR_TYPE 0 1 3 11 0 6 0
RED_CAR 0 1 2 3 0 2 0
REVOKED 0 1 2 3 0 2 0
URBANICITY 0 1 19 19 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
TARGET_FLAG 0 1.00 0.26 0.44 0 0 0 1 1.0
TARGET_AMT 0 1.00 1504.32 4704.03 0 0 0 1036 107586.1
KIDSDRIV 0 1.00 0.17 0.51 0 0 0 0 4.0
AGE 6 1.00 44.79 8.63 16 39 45 51 81.0
HOMEKIDS 0 1.00 0.72 1.12 0 0 0 1 5.0
YOJ 454 0.94 10.50 4.09 0 9 11 13 23.0
INCOME 445 0.95 61898.09 47572.68 0 28097 54028 85986 367030.0
HOME_VAL 464 0.94 154867.29 129123.77 0 0 161160 238724 885282.0
TRAVTIME 0 1.00 33.49 15.91 5 22 33 44 142.0
BLUEBOOK 0 1.00 15709.90 8419.73 1500 9280 14440 20850 69740.0
TIF 0 1.00 5.35 4.15 1 1 4 7 25.0
OLDCLAIM 0 1.00 4037.08 8777.14 0 0 0 4636 57037.0
CLM_FREQ 0 1.00 0.80 1.16 0 0 0 2 5.0
MVR_PTS 0 1.00 1.70 2.15 0 0 1 3 13.0
CAR_AGE 510 0.94 8.33 5.70 -3 1 8 12 28.0

1.2 Distributions

Before building a model, we need to make sure that we have both classes equally presented in our TARGET_FLAG variable. Class 1 takes 27% and class 0 takes 63% of the target variable. As a result, we have unbalanced class distribution for our target variable that we have to deal with, we have to take some additional steps (bootstrapping, etc) before using logistic regression.
Distribution of Target Flag
Value %
0 0.74
1 0.26

The distribution of the second target variable TARGET_AMT is right skewed, we will also transform the variable to make it follow the normal distribution (log/BoxCox).

1.3 Box Plots

1.4 Scatter Plot

## `geom_smooth()` using formula = 'y ~ x'

1.5 Correlation Matrix

2. Data preparation

2.1 Data types

First, we will remove prefixes z_ and $ together with the INDEX variable (identification Variable).

Transform to factor variables TARGET_FLAG, CAR_TYPE, CAR_USE, EDUCATION, JOB, MSTATUS, PARENT1, RED_CAR, REVOKED, SEX and URBANICITY.

Transform to numeric variables INCOME, HOME_VAL, BLUEBOOK, OLDCLAIM.

As we see below, there are no unwanted characters in the factor variables. The JOB variable contains empty level "", it was substituted with "Unknown". RED_CAR levels will be “Yes/No” instead of “yes/no”.

## $TARGET_FLAG
## [1] "0" "1"
## 
## $PARENT1
## [1] "No"  "Yes"
## 
## $CAR_TYPE
## [1] "Minivan"     "Panel Truck" "Pickup"      "Sports Car"  "SUV"        
## [6] "Van"        
## 
## $JOB
## [1] "Blue Collar"  "Clerical"     "Doctor"       "Home Maker"   "Lawyer"      
## [6] "Manager"      "Professional" "Student"      "Unknown"     
## 
## $CAR_USE
## [1] "Commercial" "Private"   
## 
## $URBANICITY
## [1] "Highly Rural/ Rural" "Highly Urban/ Urban"
## 
## $RED_CAR
## [1] "No"  "Yes"
## 
## $REVOKED
## [1] "No"  "Yes"
## 
## $MSTATUS
## [1] "No"  "Yes"
## 
## $EDUCATION
## [1] "<High School" "Bachelors"    "High School"  "Masters"      "PhD"         
## 
## $SEX
## [1] "F" "M"

2.3 Transformations and Missing Values

First, we’ll manually adjust two special cases of missing or outlier values.

In cases where YOJ is zero and INCOME is NA, we’ll set INCOME to zero to avoid imputing new values over legitimate instances of non-employment.

##           JOB
## 1  Home Maker
## 2     Student
## 36    Unknown

There is also at least one value of CAR_AGE that is less than zero - we’ll assume this is a data collection error and set it to zero (representing a brand-new car.)

We’ll use MICE to impute values for our remaining variables with missing values - AGE, YOJ, CAR_AGE, INCOME and HOME_VALUE. We might reasonably assume there are relationships between them (older, more years on the job may correlate with higher income and home value). Taking simple means or medians might suppress those features, but MICE should provide a better imputation.

Log transformation will be applied to variables INCOME, TARGET_AMT, OLDCLAIM to transform their distributions from right-skewed to the normal.

BoxCox transformation will be applied to variables BLUEBOOK, TRAVTIME, TIF, so they follow the normal distribution.

Binning values for CAR_AGE, HOME_VAL and TIF:

Creating dummy variables for factors with two levels:

2.4 Visualizations

##  [1] "TARGET_FLAG"   "TARGET_AMT"    "KIDSDRIV"      "AGE"          
##  [5] "HOMEKIDS"      "YOJ"           "INCOME"        "HOME_VAL"     
##  [9] "EDUCATION"     "JOB"           "TRAVTIME"      "BLUEBOOK"     
## [13] "TIF"           "CAR_TYPE"      "OLDCLAIM"      "CLM_FREQ"     
## [17] "MVR_PTS"       "CAR_AGE"       "CAR_AGE_BIN"   "HOME_VAL_BIN" 
## [21] "TIF_BIN"       "MALE"          "MARRIED"       "LIC_REVOKED"  
## [25] "CAR_RED"       "PRIVATE_USE"   "SINGLE_PARENT" "URBAN"        
## [29] "CAR_CRASH"
## Warning in data.table::melt(.): The melt generic in data.table has been passed
## a data.frame and will attempt to redirect to the relevant reshape2 method;
## please note that reshape2 is deprecated, and this redirection is now deprecated
## as well. To continue using melt methods from reshape2 while both libraries are
## attached, e.g. melt.list, you can prepend the namespace like reshape2::melt(.).
## In the next version, this warning will become an error.
## No id variables; using all as measure variables
## Warning: attributes are not identical across measure variables; they will be
## dropped

2.x Training and Validation Sets

(Move this split to after all transforms)

3. Multiple linear regression

3.1 Model 1 - Lasso

## 41 x 1 sparse Matrix of class "dgCMatrix"
##                                 s1
## (Intercept)          -1.450670e+00
## KIDSDRIV              8.242256e-01
## AGE                  -1.590170e-03
## HOMEKIDS              2.966269e-02
## YOJ                  -1.843188e-02
## INCOME               -6.032096e-06
## HOME_VAL              .           
## EDUCATIONBachelors   -6.083644e-01
## EDUCATIONHigh School  6.004131e-02
## EDUCATIONMasters     -4.377078e-01
## EDUCATIONPhD         -1.234115e-01
## JOBClerical           1.422133e-01
## JOBDoctor            -9.728426e-01
## JOBHome Maker         .           
## JOBLawyer            -1.137389e-01
## JOBManager           -1.529192e+00
## JOBProfessional      -1.013136e-01
## JOBStudent           -1.994015e-01
## JOBUnknown           -6.904978e-01
## TRAVTIME              6.940606e-02
## BLUEBOOK             -7.084013e-03
## TIF                  -1.700780e-01
## CAR_TYPEPanel Truck   6.532344e-01
## CAR_TYPEPickup        7.223041e-01
## CAR_TYPESports Car    1.591797e+00
## CAR_TYPESUV           1.135741e+00
## CAR_TYPEVan           7.850491e-01
## OLDCLAIM              4.862112e-02
## CLM_FREQ              7.523577e-02
## MVR_PTS               2.709291e-01
## CAR_AGE              -7.336134e-03
## CAR_AGE_BIN           .           
## HOME_VAL_BIN         -3.339554e-01
## TIF_BIN              -1.534265e-01
## MALE1                 4.100768e-02
## MARRIED1             -8.001639e-01
## LIC_REVOKED1          1.591403e+00
## CAR_RED1              3.091972e-02
## PRIVATE_USE1         -1.677192e+00
## SINGLE_PARENT1        1.042326e+00
## URBAN1                3.650730e+00
## 41 x 1 sparse Matrix of class "dgCMatrix"
##                                 s1
## (Intercept)          -1.123592e+00
## KIDSDRIV              5.781403e-01
## AGE                   .           
## HOMEKIDS              .           
## YOJ                   .           
## INCOME               -6.145362e-06
## HOME_VAL              .           
## EDUCATIONBachelors    .           
## EDUCATIONHigh School  2.565739e-01
## EDUCATIONMasters      .           
## EDUCATIONPhD          .           
## JOBClerical           .           
## JOBDoctor             .           
## JOBHome Maker         .           
## JOBLawyer             .           
## JOBManager           -1.060109e+00
## JOBProfessional       .           
## JOBStudent            .           
## JOBUnknown            .           
## TRAVTIME              4.069176e-02
## BLUEBOOK             -5.715392e-03
## TIF                  -1.354175e-01
## CAR_TYPEPanel Truck   .           
## CAR_TYPEPickup        .           
## CAR_TYPESports Car    5.614647e-01
## CAR_TYPESUV           2.868858e-01
## CAR_TYPEVan           .           
## OLDCLAIM              6.328528e-02
## CLM_FREQ              1.384936e-03
## MVR_PTS               2.502005e-01
## CAR_AGE              -1.463889e-02
## CAR_AGE_BIN           .           
## HOME_VAL_BIN         -3.804650e-01
## TIF_BIN              -6.383520e-02
## MALE1                 .           
## MARRIED1             -4.457439e-01
## LIC_REVOKED1          1.321738e+00
## CAR_RED1              .           
## PRIVATE_USE1         -1.384149e+00
## SINGLE_PARENT1        1.010721e+00
## URBAN1                2.952228e+00
## 
## Call:  cv.glmnet(x = X, y = Y, nfolds = 5, family = "gaussian", standardize = TRUE,      alpha = 1) 
## 
## Measure: Mean-Squared Error 
## 
##      Lambda Index Measure     SE Nonzero
## min 0.01331    51   25.33 0.4037      37
## 1se 0.16407    24   25.73 0.4060      20

## $mse
## lambda.1se 
##   25.50668 
## attr(,"measure")
## [1] "Mean-Squared Error"
## 
## $mae
## lambda.1se 
##   4.166593 
## attr(,"measure")
## [1] "Mean Absolute Error"
## $AICc
## [1] -48241.06
## 
## $BIC
## [1] -47990.49
## $AICc
## [1] -44758.86
## 
## $BIC
## [1] -44623.31
## 41 x 1 sparse Matrix of class "dgCMatrix"
##                                 s1
## (Intercept)          -1.450670e+00
## KIDSDRIV              8.242256e-01
## AGE                  -1.590170e-03
## HOMEKIDS              2.966269e-02
## YOJ                  -1.843188e-02
## INCOME               -6.032096e-06
## HOME_VAL              .           
## EDUCATIONBachelors   -6.083644e-01
## EDUCATIONHigh School  6.004131e-02
## EDUCATIONMasters     -4.377078e-01
## EDUCATIONPhD         -1.234115e-01
## JOBClerical           1.422133e-01
## JOBDoctor            -9.728426e-01
## JOBHome Maker         .           
## JOBLawyer            -1.137389e-01
## JOBManager           -1.529192e+00
## JOBProfessional      -1.013136e-01
## JOBStudent           -1.994015e-01
## JOBUnknown           -6.904978e-01
## TRAVTIME              6.940606e-02
## BLUEBOOK             -7.084013e-03
## TIF                  -1.700780e-01
## CAR_TYPEPanel Truck   6.532344e-01
## CAR_TYPEPickup        7.223041e-01
## CAR_TYPESports Car    1.591797e+00
## CAR_TYPESUV           1.135741e+00
## CAR_TYPEVan           7.850491e-01
## OLDCLAIM              4.862112e-02
## CLM_FREQ              7.523577e-02
## MVR_PTS               2.709291e-01
## CAR_AGE              -7.336134e-03
## CAR_AGE_BIN           .           
## HOME_VAL_BIN         -3.339554e-01
## TIF_BIN              -1.534265e-01
## MALE1                 4.100768e-02
## MARRIED1             -8.001639e-01
## LIC_REVOKED1          1.591403e+00
## CAR_RED1              3.091972e-02
## PRIVATE_USE1         -1.677192e+00
## SINGLE_PARENT1        1.042326e+00
## URBAN1                3.650730e+00

Model Performance

3.2 Model 2 -

3.3 Model 3 -

3.4 Model selection

4. Binary logistic regression

4.1 Model 1 - Lasso

## 41 x 1 sparse Matrix of class "dgCMatrix"
##                                 s1
## (Intercept)          -1.716098e+00
## KIDSDRIV              4.114268e-01
## AGE                  -1.274043e-03
## HOMEKIDS              2.779506e-02
## YOJ                  -5.912864e-03
## INCOME               -3.964779e-06
## HOME_VAL              .           
## EDUCATIONBachelors   -3.129844e-01
## EDUCATIONHigh School  1.563180e-02
## EDUCATIONMasters     -2.467550e-01
## EDUCATIONPhD         -4.260311e-02
## JOBClerical           6.781448e-02
## JOBDoctor            -4.680465e-01
## JOBHome Maker        -2.933138e-02
## JOBLawyer             .           
## JOBManager           -7.506532e-01
## JOBProfessional      -3.377667e-02
## JOBStudent           -1.395579e-01
## JOBUnknown           -2.049771e-01
## TRAVTIME              3.869096e-02
## BLUEBOOK             -4.230398e-03
## TIF                  -5.004348e-02
## CAR_TYPEPanel Truck   4.776375e-01
## CAR_TYPEPickup        4.281607e-01
## CAR_TYPESports Car    8.755467e-01
## CAR_TYPESUV           6.355374e-01
## CAR_TYPEVan           5.122177e-01
## OLDCLAIM              2.504929e-02
## CLM_FREQ              3.513050e-02
## MVR_PTS               1.068372e-01
## CAR_AGE              -2.727552e-03
## CAR_AGE_BIN           .           
## HOME_VAL_BIN         -1.660078e-01
## TIF_BIN              -1.261495e-01
## MALE1                 5.464939e-03
## MARRIED1             -4.532368e-01
## LIC_REVOKED1          6.960323e-01
## CAR_RED1              3.006562e-02
## PRIVATE_USE1         -8.083996e-01
## SINGLE_PARENT1        3.882197e-01
## URBAN1                2.285270e+00
## 41 x 1 sparse Matrix of class "dgCMatrix"
##                                 s1
## (Intercept)          -1.464448e+00
## KIDSDRIV              3.295352e-01
## AGE                  -1.769623e-04
## HOMEKIDS              1.048500e-02
## YOJ                   .           
## INCOME               -3.680548e-06
## HOME_VAL              .           
## EDUCATIONBachelors   -1.070104e-01
## EDUCATIONHigh School  7.383992e-02
## EDUCATIONMasters     -3.456691e-02
## EDUCATIONPhD          .           
## JOBClerical           3.677502e-02
## JOBDoctor            -7.665609e-02
## JOBHome Maker         .           
## JOBLawyer             .           
## JOBManager           -5.667630e-01
## JOBProfessional       .           
## JOBStudent            .           
## JOBUnknown            .           
## TRAVTIME              2.797189e-02
## BLUEBOOK             -3.228510e-03
## TIF                  -5.098952e-02
## CAR_TYPEPanel Truck   .           
## CAR_TYPEPickup        6.408488e-02
## CAR_TYPESports Car    4.575649e-01
## CAR_TYPESUV           2.774664e-01
## CAR_TYPEVan           3.657761e-02
## OLDCLAIM              2.599506e-02
## CLM_FREQ              1.974628e-02
## MVR_PTS               9.972623e-02
## CAR_AGE              -7.803403e-03
## CAR_AGE_BIN           .           
## HOME_VAL_BIN         -1.642793e-01
## TIF_BIN              -8.078956e-02
## MALE1                 .           
## MARRIED1             -3.288395e-01
## LIC_REVOKED1          6.046503e-01
## CAR_RED1              .           
## PRIVATE_USE1         -7.491766e-01
## SINGLE_PARENT1        3.783721e-01
## URBAN1                1.913118e+00
## 
## Call:  cv.glmnet(x = X, y = Y, nfolds = 5, family = "binomial", link = "logit",      standardize = TRUE, alpha = 1) 
## 
## Measure: Binomial Deviance 
## 
##       Lambda Index Measure       SE Nonzero
## min 0.001033    51  0.9039 0.010248      37
## 1se 0.007285    30  0.9132 0.009009      28

## $deviance
## lambda.1se 
##  0.9055072 
## attr(,"measure")
## [1] "Binomial Deviance"
## 
## $class
## lambda.1se 
##  0.2133885 
## attr(,"measure")
## [1] "Misclassification Error"
## 
## $auc
## [1] 0.810068
## attr(,"measure")
## [1] "AUC"
## 
## $mse
## lambda.1se 
##  0.2944212 
## attr(,"measure")
## [1] "Mean-Squared Error"
## 
## $mae
## lambda.1se 
##  0.6091211 
## attr(,"measure")
## [1] "Mean Absolute Error"
## $AICc
## [1] -1628.926
## 
## $BIC
## [1] -1378.357
## $AICc
## [1] -1565.658
## 
## $BIC
## [1] -1375.96
## 41 x 1 sparse Matrix of class "dgCMatrix"
##                                 s1
## (Intercept)          -1.716098e+00
## KIDSDRIV              4.114268e-01
## AGE                  -1.274043e-03
## HOMEKIDS              2.779506e-02
## YOJ                  -5.912864e-03
## INCOME               -3.964779e-06
## HOME_VAL              .           
## EDUCATIONBachelors   -3.129844e-01
## EDUCATIONHigh School  1.563180e-02
## EDUCATIONMasters     -2.467550e-01
## EDUCATIONPhD         -4.260311e-02
## JOBClerical           6.781448e-02
## JOBDoctor            -4.680465e-01
## JOBHome Maker        -2.933138e-02
## JOBLawyer             .           
## JOBManager           -7.506532e-01
## JOBProfessional      -3.377667e-02
## JOBStudent           -1.395579e-01
## JOBUnknown           -2.049771e-01
## TRAVTIME              3.869096e-02
## BLUEBOOK             -4.230398e-03
## TIF                  -5.004348e-02
## CAR_TYPEPanel Truck   4.776375e-01
## CAR_TYPEPickup        4.281607e-01
## CAR_TYPESports Car    8.755467e-01
## CAR_TYPESUV           6.355374e-01
## CAR_TYPEVan           5.122177e-01
## OLDCLAIM              2.504929e-02
## CLM_FREQ              3.513050e-02
## MVR_PTS               1.068372e-01
## CAR_AGE              -2.727552e-03
## CAR_AGE_BIN           .           
## HOME_VAL_BIN         -1.660078e-01
## TIF_BIN              -1.261495e-01
## MALE1                 5.464939e-03
## MARRIED1             -4.532368e-01
## LIC_REVOKED1          6.960323e-01
## CAR_RED1              3.006562e-02
## PRIVATE_USE1         -8.083996e-01
## SINGLE_PARENT1        3.882197e-01
## URBAN1                2.285270e+00

Model Performance

##          True
## Predicted    0   1 Total
##     0     1102 244  1346
##     1      100 187   287
##     Total 1202 431  1633
## 
##  Percent Correct:  0.7893

4.2 Model 2 -

4.3 Model 3 -

4.4 Model selection

5. Predictions

6. Conclusion

7. References

Appendix: R code